Forced Expiratory Volume (FEV) is an index of pulmonary function that measures the volume of the air expelled after one second of constant effort. The data contains the determinations of FEB on 654 children ages 6 – 22 who are seen in childhood respiratory disease study in 1980 in East Boston, Massachusetts. The data are part of a larger study to follow the change in pulmonary function over time in children.
| VariableName | Category | Description |
|---|---|---|
| ID | Numeric | Uniquely Identified Row |
| Age | Numeric | Age of the child |
| Height | Numeric | Height of the child in inches |
| Sex | Categorical | Sex of the child |
| Smoker | Categorical | Whether the child is a non-smoker or current smoker |
| FEV | Numeric | FEV of the child in litres |
The data contains the determinations of FEB on 654 children ages 6 – 22 who are seen in childhood respiratory disease study in 1980 in East Boston, Massachusetts. The data are part of a larger study to follow the change in pulmonary function over time in children
Note: No citation required for this source (http://www.statsci.org/data/general/fev.html)
This dataset has chosen by us because of personal interests, whether we can predict the child’s FEV with the help of the available predictor, rather going for a Pulmonary Function Test, which will identify any pulmonary disease of a child. And secondly, to do statistical analysis on what are the predictors responsible to increase / decrease the pulmonary function of the child and try to find answer on these lines.
In order to predict the child’s Forced Exporatory Volume (FEV) based on the Child’s Age, Height, Sex and whether the child is a smoker or not. This will help for kick-start the treatment of the child based on the FEV reading, rather going throgh the Pulmonary Function test.
The goal of the model is to find the best model after going through the several methods, which predict the child’s FEV with minimal error. The best model would have the lowest Root Mean Square Error (RMSE) against Leave One Out cross validation (LOOCV).
We will be doing several data analysis and methods in this section, where each section will be describing what’s the part of the tasks.
childfev = read.csv("http://www.statsci.org/data/general/fev.txt",
sep = "\t",
quote = "\"",
comment.char = "",
stringsAsFactors = FALSE)
childfev$Sex = as.factor(childfev$Sex)
childfev$Smoker = as.factor(childfev$Smoker)
str(childfev)
## 'data.frame': 654 obs. of 6 variables:
## $ ID : int 301 451 501 642 901 1701 1752 1753 1901 1951 ...
## $ Age : int 9 8 7 9 9 8 6 6 8 9 ...
## $ FEV : num 1.71 1.72 1.72 1.56 1.9 ...
## $ Height: num 57 67.5 54.5 53 57 61 58 56 58.5 60 ...
## $ Sex : Factor w/ 2 levels "Female","Male": 1 1 1 2 2 1 1 1 1 1 ...
## $ Smoker: Factor w/ 2 levels "Current","Non": 2 2 2 2 2 2 2 2 2 2 ...
dim(childfev)
## [1] 654 6
head(childfev$FEV,10)
## [1] 1.708 1.724 1.720 1.558 1.895 2.336 1.919 1.415 1.987 1.942
From the dataset, it observered data, Age, Heigh is a numerical variable, where as Sex / Smoker field is a categorical variable. The FEV is the numerical response variable.
pairs(childfev[c('Age','FEV','Height','Sex','Smoker')], col = "darkgrey")
From the pairs plot, it sees there is a clear signs of linear releationship between FEV and Height. However there is no clear sign of linear relationship between FEV and Age variable. The 2 categorical variable Sex and Smoker seems to have 2 distinct data. Let’s explore this
Let’s check the Correlation Matrix to explore what’s the co-relation among the variables
cor(childfev[c('Age','Height','FEV')])
## Age Height FEV
## Age 1.0000 0.7919 0.7565
## Height 0.7919 1.0000 0.8681
## FEV 0.7565 0.8681 1.0000
It seems that what the pair plot shows seems correct, from the corelation matrix it also suggests that Age and Height are higly corelation with FEV response, as well as among it self, We need to explore the variance inflation factor (VIF) while creating our model in further.
table(childfev$Sex)
##
## Female Male
## 318 336
table(childfev$Smoker)
##
## Current Non
## 65 589
Let’s start with the Multiple Linear Regression (MLR) against all the predictor (Except the ID), to predict the FEV as the response
mlr_model = lm(FEV ~ Age + Height + Sex + Smoker, data = childfev)
The Multiple Linear Regression contains all the predictor in the model, which also includes the dummy variable for the categorical predictor (Sex and Smoker)
Let’s explore the polynomial transformation of the predictor to see whether we can able to find the best model which lower the RMSE LOOCV
poly2_model = lm(FEV ~ Sex + Smoker + Age + Height + I(Age^2) + I(Height^2) , data = childfev)
poly3_model = lm(FEV ~ Sex + Smoker + Age + Height + I(Age^2) + I(Height^2) + I(Age^3) + I(Height^3) , data = childfev)
poly4_model = lm(FEV ~ Sex + Smoker + Age + Height + I(Age^2) + I(Height^2) + I(Age^3) + I(Height^3) + I(Age^4) + I(Height^4), data = childfev)
poly5_model = lm(FEV ~ Sex + Smoker + Age + Height + I(Age^2) + I(Height^2) + I(Age^3) + I(Height^3) + I(Age^4) + I(Height^4) + I(Age^5) + I(Height^5), data = childfev)
poly6_model = lm(FEV ~ Sex + Smoker + Age + Height + I(Age^2) + I(Height^2) + I(Age^3) + I(Height^3) + I(Age^4) + I(Height^4) + I(Age^5) + I(Height^5) + I(Age^6) + I(Height^6), data = childfev)
Following are the exploration needs to be done as part of the Interaction between the predictors - Interaction between Numerical Predictor (Age, Height) and Categorical Predictor (Sex, Smoker) - One at a Time - Interaction between Numerical Predictors (Age, Height) it self - Interaction between all of the them 2 way and 3 way. - Carry out Inova F Test and RMSE LOOCV and Average Percentage Error to find out what’s the best model
model_age_sex = lm(FEV ~ (Age + Sex) ^ 2 , data = childfev)
model_age_smoker = lm(FEV ~ (Age + Smoker) ^ 2 , data = childfev)
model_age_sex_smoker = lm(FEV ~ Age + Sex + Smoker + Age:Sex + Age:Smoker , data = childfev)
model_height_sex = lm(FEV ~ (Height + Sex) ^ 2 , data = childfev)
model_height_smoker = lm(FEV ~ (Height + Smoker) ^ 2 , data = childfev)
model_height_sex_smoker = lm(FEV ~ Height + Sex + Smoker + Height:Sex + Height:Smoker , data = childfev)
model_all_int_except_age_height = lm(FEV ~ Age + Height + Sex + Smoker + Age:Sex + Age:Smoker + Height:Sex + Height:Smoker, data = childfev)
model_all_int_with_age_height = lm(FEV ~ Age + Height + Sex + Smoker + Age:Sex + Age:Smoker + Height:Sex + Height:Smoker + Age:Height, data = childfev)
model_all_2way = lm(FEV ~ (Age + Height + Sex + Smoker) ^ 2, data = childfev)
model_all_3way = lm(FEV ~ (Age + Height + Sex + Smoker) ^ 3, data = childfev)
Let’s create a big model, with Polynomial of degree 3 and 3 way interaction between categorical-to-categorical, categorical-to-numeric, numeric-to-numeric and see it’s score.
Also create few models with combination from the previous model (which treat as good model) - Polynomial degree 2 + Interaction Model 8 - Polynomial degree 2 + Interaction Model 9 (2 way interaction) - Polynomial degree 2 + Interaction Model 10 (3 way interaction)
big_model = lm(FEV ~ (Age + Height + Sex + Smoker) ^ 3 + I(Age^2) + I(Height^2) + I(Age^3) + I(Height^3), data = childfev) # Big Model - Polynomial degree 3 + 3 Way Interaction
poly2_int_model8 = lm(FEV ~ Age + Height + Sex + Smoker + Age:Sex + Age:Smoker + Height:Sex + Height:Smoker + Age:Height , data = childfev) # Polynomial degree 2 + All Interaction Except Sex:Smoker
poly2_int_model9 = lm(FEV ~ (Age + Height + Sex + Smoker) ^ 2, data = childfev) # Polynomial degree 2 + 2 Way Interaction
poly2_int_model10 = lm(FEV ~ (Age + Height + Sex + Smoker) ^ 3, data = childfev) # Polynomial degree 2 + 3 Way Interaction
Let’s do the AIC and BIC to see what’ the outcome, after do the ANOVA F Test.
n = length(resid(big_model))
big_model_aic = step(big_model, direction = "backward", trace = 0)
big_model_bic = step(big_model, direction = "backward", k = log(n), trace = 0)
Let’s compare with the good model from the Interaction model and Polynomial Regression Model, to see which one is siginificance
Let’s evaluate the results which is derived from the above 5 different category of models
It seems that as the Polynomial degree increases, the RSS and RMSE decreases, which should be as the model becomes more and more complex, the error will decreased. However the RMSE LOOCV seems increases after the Polynomial degree 2. Hence it’s not been seen good, if we increase the model beyond degree 2, as the Root Mean Square Error (RMSE) Leave One Out Cross Validation Error increases.
Let’s see what happens to the R2 and Adjusted R2 when the polynomail degree increases. Also let’s see what will be the Average Percentage Error, when the Polynomail degree increases.
It seems the R2 keeps increasing as the Polynomial degree increases. However adjusted R2 seems mostly fixed on/beyond Polynomial degree 2. Which kind of gives enough evidence to go for the model having Polynomail degree 2.
The 3rd plot which gives the average percentage of error, which is calculated from the below formula
which depicts that the percentage of error doesn’t decrease after Polynomial degree 2, and increased. This also have inclination towards the Polynomial Regression model with Degree 2.
Let’s do an Inova F Test to compare these models, to see whether the Polynomial degree has significance or not.
Each point shows the p value comes from the Anova F Test between Polynomial Regresssion of degree 1 Vs Polynomial Regresssion of degree 2 and rest all points are between Polynomial Regresssion of degree 2 and Polynomial Regresssion of degree 3/4/5/6. It clearly observed that for alpha value 0.05 we failed to reject all hypothesis of estimated parameters to be zero for Polynomial of degree more than 2.
Let’s see the comparision of the data in actual between all the models, to concrete the Model with Polynomial regression of degree 2 is the best model
| ID | Interaction_Model | rss | rmse | rmse_loocv | r2 | adj_r2 |
|---|---|---|---|---|---|---|
| 1 | MLR | 110.28 | 10.501 | 0.4145 | 0.7754 | 0.7740 |
| 2 | Polynomial Degree 2 | 100.99 | 10.049 | 0.3980 | 0.7943 | 0.7924 |
| 3 | Polynomial Degree 3 | 100.59 | 10.029 | 0.3984 | 0.7951 | 0.7926 |
| 4 | Polynomial Degree 4 | 100.50 | 10.025 | 0.3996 | 0.7953 | 0.7921 |
| 5 | Polynomial Degree 5 | 100.21 | 10.010 | 0.3989 | 0.7959 | 0.7921 |
| 6 | Polynomial Degree 6 | 99.47 | 9.973 | 0.4012 | 0.7974 | 0.7929 |
It seems even though ANOVA F Test suugest to Model 10 is significance (3 Way), the RMSE LOOCV is small (not that though) against Model 8. We will keep these 2 modesl (Model 8 and Model 10) are the best model, which we will explore further
It also clearly gives confidence that Polynomial degree 2 would be the best model, from the following methods which we have tested - Lowest RMSE LOOCV - Highest R2 and Adjusted R2 - Lowest Average Percentage Error - Anova F Test P value
It seems that the interaction between the Age with categorical variable (Sex and Smoker) has changed the estimates drasticallly, than the interaction between Height with the Categorical variable (Sex and Smoker).
Let’s conduct the Anova F Test to verify, which interaction is significance and which is not
| ID | null_Model | full_Model | p_value | descision |
|---|---|---|---|---|
| 1 | Age+Sex+Age:Sex | Age+Sex+Smoker+Age:Sex+Age:Smoker | 0.0000 | Age+Sex+Smoker+Age:Sex+Age:Smoker |
| 2 | Age+Smoker+Age:Smoker | Age+Sex+Smoker+Age:Sex+Age:Smoker | 0.0000 | Age+Sex+Smoker+Age:Sex+Age:Smoker |
| 3 | Height+Sex+Height:Sex | Height+Sex+Smoker+Height:Sex+Height:Smoker | 0.4218 | Height+Sex+Height:Sex |
| 4 | Height+Smoker+Height:Smoker | Height+Sex+Smoker+Height:Sex+Height:Smoker | 0.0000 | Height+Sex+Smoker+Height:Sex+Height:Smoker |
| 5 | Age+Sex+Smoker+Age:Sex+Age:Smoker | Age+Height+Age:Sex+Age:Smoker+Height:Sex+Height:Smoker | 0.0000 | Age+Height+Age:Sex+Age:Smoker+Height:Sex+Height:Smoker |
| 6 | Height+Sex+Height:Sex | Age+Height+Age:Sex+Age:Smoker+Height:Sex+Height:Smoker | 0.0000 | Age+Height+Age:Sex+Age:Smoker+Height:Sex+Height:Smoker |
| 7 | Height+Sex+Smoker+Height:Sex+Height:Smoker | Age+Height+Age:Sex+Age:Smoker+Height:Sex+Height:Smoker | 0.0000 | Age+Height+Age:Sex+Age:Smoker+Height:Sex+Height:Smoker |
| 8 | Age+Height+Age:Sex+Age:Smoker+Height:Sex+Height:Smoker | Age+Height+Age:Sex+Age:Smoker+Height:Sex+Height:Smoker+Age:Height | 0.0000 | Age+Height+Age:Sex+Age:Smoker+Height:Sex+Height:Smoker+Age:Height |
| 9 | Age+Height+Age:Sex+Age:Smoker+Height:Sex+Height:Smoker+Age:Height | 2 way | 0.8041 | Age+Height+Age:Sex+Age:Smoker+Height:Sex+Height:Smoker+Age:Height |
| 10 | Age+Height+Age:Sex+Age:Smoker+Height:Sex+Height:Smoker+Age:Height | 3 way | 0.0000 | 3way |
It seems the 3 way interaction seems significance among all the interaction between the numeric and categorical variable.
Let’s see how these model perform on the RSS, RMSE, RMSE LOOCV and
It seems that as the Interaction increased, the RSS and RMSE decreases, which should be as the model becomes more and more complex, the error will decreased. Also the RMSE LOOCV also decreases with increase in Interaction, however the RMSE LOOCV increased slighly after the Model 9
Let’s see what happens to the R2 and Adjusted R2 when the polynomail degree increases. Also let’s see what will be the Average Percentage Error, when the Polynomail degree increases.
It seems the R2 and adjusted R2 keeps increasing as the Polynomial degree increases. Also the error rate is also decreasing as the interaction increases. However after the Model 9 the Error rate increases slightly as the interaction increases
Let’s see the comparision of the data in actual between all the models, to see what’s the best finalize model
| ID | Interaction_Model | rss | rmse | rmse_loocv | r2 | adj_r2 |
|---|---|---|---|---|---|---|
| 1 | Age+Sex+Age:Sex | 175.51 | 13.248 | 0.5214 | 0.6425 | 0.6408 |
| 2 | Age+Smoker+Age:Smoker | 199.27 | 14.116 | 0.5566 | 0.5941 | 0.5922 |
| 3 | Age+Sex+Smoker+Age:Sex+Age:Smoker | 165.53 | 12.866 | 0.5085 | 0.6628 | 0.6602 |
| 4 | Height+Sex+Height:Sex | 114.88 | 10.718 | 0.4218 | 0.7660 | 0.7649 |
| 5 | Height+Smoker+Height:Smoker | 120.32 | 10.969 | 0.4324 | 0.7549 | 0.7538 |
| 6 | Height+Sex+Smoker+Height:Sex+Height:Smoker | 114.58 | 10.704 | 0.4231 | 0.7666 | 0.7648 |
| 7 | Age+Height+Sex+Smoker+Age:Sex+Age:Smoker+Height:Sex+Height:Smoker | 104.29 | 10.212 | 0.4059 | 0.7876 | 0.7849 |
| 8 | Age+Height+Sex+Smoker+Age:Sex+Age:Smoker+Height:Sex+Height:Smoker+Age:Height | 98.15 | 9.907 | 0.3948 | 0.8001 | 0.7973 |
| 9 | 2 Way | 98.14 | 9.907 | 0.3964 | 0.8001 | 0.7970 |
| 10 | 3 Way | 93.89 | 9.907 | 0.3964 | 0.8087 | 0.8046 |
It kind of confirmation that Polynomial regression with degree 2 is lowest RMSE LOOCV. We have decided to see it’s the best model.
ANOVA F Test of the AIC/BIC
| ID | null_Model | full_Model | p_value | descision |
|---|---|---|---|---|
| 1 | Polynomial Degree 2 | Big Model | 0.0000 | Big Model |
| 2 | All Interaction Except Sex:Smoker | Big Model | 0.0001 | Big Model |
| 3 | 2 way | Big Model | 0.0000 | Big Model |
| 4 | 3 way | Big Model | 0.2198 | 3 way |
| 5 | Polynomial Degree 2 + All Interaction Except Sex:Smoker | Big Model | 0.0001 | Big Model |
| 6 | Polynomial Degree 2 + 2 way | Big Model | 0.0000 | Big Model |
| 7 | Polynomial Degree 2 + 3 way | Big Model | 0.2198 | Polynomial Degree 2 + 3 way |
| 8 | Big Model - AIC | Big Model | 0.7976 | Big Model - AIC |
| 9 | Big Model - BIC | Big Model | 0.3877 | Big Model - BIC |
| 10 | AIC - Polynomial Degree 2 + All Interaction Except Sex:Smoker | Big Model | 0.0003 | Big Model |
| 11 | BIC - Polynomial Degree 2 + All Interaction Except Sex:Smoker | Big Model | 0.0000 | Big Model |
| 12 | AIC - Polynomial Degree 2 + 2 way | Big Model | 0.0003 | Big Model |
| 13 | BIC- Polynomial Degree 2 + 2 way | Big Model | 0.0000 | Big Model |
| 14 | AIC - Polynomial Degree 2 + 3 way | Big Model | 0.3877 | AIC - Polynomial Degree 2 + 3 way |
| 15 | BIC - Polynomial Degree 2 + 3 way | Big Model | 0.3877 | BIC - Polynomial Degree 2 + 3 way |
From the above Table, its been observed that the AIC / BIC model of the Big model / Polynomial Degree 2 + 3 way communication is being significance.
Let’s do the RSS, RMSE, RMSE LOOCV, R2 and Adjusted R2 and Average Error Rate to compare these 15 models
It seems that as the Interaction increased, the RSS and RMSE decreases, which should be as the model becomes more and more complex, the error will decreased. Also the RMSE LOOCV also decreases with increase in Interaction, however the RMSE LOOCV increased slighly after the Model 9
Let’s see what happens to the R2 and Adjusted R2 when the polynomail degree increases. Also let’s see what will be the Average Percentage Error, when the Polynomail degree increases.
It seems the R2 and adjusted R2 keeps increasing as the Polynomial degree increases. Also the error rate is also decreasing as the interaction increases. However after the Model 9 the Error rate increases slightly as the interaction increases
Let’s see the comparision of the data in actual between all the models, to see what’s the best finalize model
| ID | Interaction_Model | rss | rmse | rmse_loocv | adj_r2 |
|---|---|---|---|---|---|
| 1 | Polynomial Degree 2 | 100.99 | 10.049 | 0.3980 | 0.7924 |
| 2 | All Interaction Except Sex:Smoker | 98.15 | 9.907 | 0.3948 | 0.7973 |
| 3 | 2 way | 98.14 | 9.907 | 0.3964 | 0.7970 |
| 4 | 3 way | 93.89 | 9.690 | 0.3907 | 0.8046 |
| 5 | Polynomial Degree 2 + All Interaction Except Sex:Smoker | 98.15 | 9.907 | 0.3948 | 0.7973 |
| 6 | AIC - Polynomial Degree 2 + All Interaction Except Sex:Smoker | 98.24 | 9.912 | 0.3933 | 0.7977 |
| 7 | BIC - Polynomial Degree 2 + All Interaction Except Sex:Smoker | 99.94 | 9.997 | 0.3949 | 0.7949 |
| 8 | Polynomial Degree 2 + 2 way | 98.14 | 9.907 | 0.3964 | 0.7970 |
| 9 | AIC - Polynomial Degree 2 + 2 way | 98.24 | 9.912 | 0.3933 | 0.7977 |
| 10 | BIC- Polynomial Degree 2 + 2 way | 99.94 | 9.997 | 0.3949 | 0.7949 |
| 11 | Polynomial Degree 2 + 3 way | 93.89 | 9.690 | 0.3907 | 0.8046 |
| 12 | AIC - Polynomial Degree 2 + 3 way | 93.98 | 9.694 | 0.3892 | 0.8050 |
| 13 | BIC - Polynomial Degree 2 + 3 way | 93.98 | 9.694 | 0.3892 | 0.8050 |
| 14 | Big Model | 93.05 | 9.646 | 0.3925 | 0.8051 |
| 15 | AIC - Big Model | 93.29 | 9.659 | 0.3897 | 0.8058 |
| 16 | BIC - Big Model | 93.98 | 9.694 | 0.3892 | 0.8050 |
The best model from the above 16 models is the Model 12 with lowest RMSE LOOCV which is 0.3892, though doesn’t have the highest RSS or RMSE. Let’s explore the Linearity Assumptions of this model
Below are the 2 assumptions which needs to be tested as part of the section - Constant Variance - Normality
Let’s plot the residual versus Fitted plot and Q-Q plot of the model
par(mfrow = c(1, 2), oma = c(0, 0, 2, 0))
plot(resid(poly2_int_model10_aic)~fitted(poly2_int_model10_aic),
pch = 1,
cex = 1,
cex.main = 0.8,
col = "darkgrey",
xlab = "Fitted Values",
ylab = "Residuals",
main = "Fitted Vs Residual - Polynomial Degree 2 + 3 Way Interaction Model")
abline (h = 0, col = "darkorange", lwd = 2, lty = 1)
qqnorm(resid(poly2_int_model10_aic),
pch = 1,
cex = 1,
cex.main = 0.8,
col = "darkgrey")
qqline(resid(poly2_int_model10_aic),
lwd = 1,
col = "red",
lty = 2)
It seems both the Constant Variance and Normality is in Suspect. From the residual plot, it seems that the for lower fitted values the variance is small, as the the fitted values increases, the variance also increasing. Also for the Q-Q plot, it seems have a fat tail.
Let’s check the BP Test and Shapiro Wiki Test to validate this
bptest(poly2_int_model10_aic)
##
## studentized Breusch-Pagan test
##
## data: poly2_int_model10_aic
## BP = 88, df = 12, p-value = 1e-13
shapiro.test(resid(poly2_int_model10_aic))
##
## Shapiro-Wilk normality test
##
## data: resid(poly2_int_model10_aic)
## W = 0.99, p-value = 0.002
It seems that both the P value is small, which confirms the Constant Variance and Normality is Suspect.
Let’s check for any influence points in the model
head(hatvalues(poly2_int_model10_aic)[hatvalues(poly2_int_model10_aic) > 2 * mean(hatvalues(poly2_int_model10_aic))],10)
## 2 23 26 59 64 104 118 191 216 222
## 0.04065 0.05486 0.05998 0.04095 0.04734 0.06494 0.04812 0.44784 0.05486 0.10747
There are around 84 observations which have the high leverage
Let’s check the high standard residuals
head(rstandard(poly2_int_model10_aic)[abs(rstandard(poly2_int_model10_aic)) > 2],10)
## 2 53 325 332 333 368 372 422 425 442
## -3.282 2.483 2.251 -2.127 -2.073 -2.230 2.469 2.928 -2.018 2.530
There are around 30 observations which have the high Standardize Residuals
Even though the model has high leverage and standard residuals, let’s check whether these are influentials or not.
This can be verified by the cook’s distance
n = length(resid(poly2_int_model10_aic))
length(cooks.distance(poly2_int_model10_aic)[cooks.distance(poly2_int_model10_aic) > (4/n)])
## [1] 41
It has been observed that there are around length(cooks.distance(poly2_int_model10_aic)[cooks.distance(poly2_int_model10_aic) > (4/n)]) influential observations in the model
Now, let’s re-fir the model, after removing the influential data to see if the model has improved or not
model_cd = cooks.distance(poly2_int_model10_aic)
poly2_int_model10 = lm(FEV ~ (Age + Height + Sex + Smoker) ^ 3, data = childfev, subset = model_cd <= (4/n)) # Polynomial degree 2 + 3 Way Interaction
poly2_int_model10_aic = step(poly2_int_model10, direction = "backward", trace = 0)
The model re-fit is now completed. Now let’s draw the diagnostic plot once again
par(mfrow = c(1, 2), oma = c(0, 0, 2, 0))
plot(resid(poly2_int_model10_aic)~fitted(poly2_int_model10_aic),
pch = 1,
cex = 1,
cex.main = 0.8,
col = "darkgrey",
xlab = "Fitted Values",
ylab = "Residuals",
main = "Fitted Vs Residual - Polynomial Degree 2 + 3 Way Interaction Model")
abline (h = 0, col = "darkorange", lwd = 2, lty = 1)
qqnorm(resid(poly2_int_model10_aic),
pch = 1,
cex = 1,
cex.main = 0.8,
col = "darkgrey")
qqline(resid(poly2_int_model10_aic),
lwd = 1,
col = "red",
lty = 2)
The plot seems much better than the earlier one. The constant Variance still seems suspect, as the constant variance is low to high as the fitted values increased from low to how. However the Q-Q plot seems good, hence the normality is not in suspect.
Let’s do the BP Test and Shapiro Wiki test to confirm
bptest(poly2_int_model10_aic)
##
## studentized Breusch-Pagan test
##
## data: poly2_int_model10_aic
## BP = 75, df = 12, p-value = 3e-11
shapiro.test(resid(poly2_int_model10_aic))
##
## Shapiro-Wilk normality test
##
## data: resid(poly2_int_model10_aic)
## W = 1, p-value = 0.6
As from the plot, the P value of the BP Test is too low, which confirms the Variance Is Suspect, where as the P value from the Shapiro Wiki Test is high which is greater than for any alpha value (0.5, 0.1), which confirms the Normality is not suspect
At the final we have around 16 models which seems to be evalauted. Out of the 16 models, we have selected our best model as the Model 12 based on the soley fact, it has the lowest Leave One Out Cross Validation (LOOCV). This model comes after the AIC performed on the model Polynomial Regression of Degree - 2 and 3 Way Interaction between the Predictrs
Below is the re-drawn of the RSS, RMSE, RMSE LOOCV, where locate the point number (12) for the model
The RMSE LOOCV is the lowest, the RSS and RMSE is not the lowest one. As we know that RSS and RMSE always decrease as the model complexity increases, hence it will be always low if we go furthe complex model beyond (12). However RMSE LOOCV is great way to validate the model performance, it’s a good model which will predict the FEV.
Also below is the replot of the R2 and Adjusted R2, where the circle is marked on the Model (12)
Let’s dicuss what’s the percentage of Error Produced by the Model 12. But before that let’s create a function which will calculates the percent of error from the below formula
We can see that the Percent of Error for the Model Model 12 is around train_test_error_final_model[12] which is not bad, though we have done a random sampling with seed. It could be possible for many many sampling, the percentage of error could reduced. However, since the Model 12 is based on the lowest RMSE LOOCV, we will finalize the model 12 as the final model